{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model Training and Validation Pipeline\n",
    "Now that you have created features, you can use them to train one or more models. In this section, you will generate feature vectors with multiple features from one or more feature sets and feed them into an automated ML training and testing pipeline to create high-quality models.\n",
    "\n",
    "The ML pipeline can be triggered and tracked manually during the interactive devel‐ opment, or it can be saved (into Git) and be executed automatically on a given schedule or as a reaction to different events (such as code modification, CI/CD, data changes, model drift, and so on). See [MLRun project and CI/CD documentation](https://docs.mlrun.org/en/stable/projects/project.html) for details.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Saving and loading projects from GIT\n",
    "\n",
    "After you saved your project and its elements (functions, workflows, artifacts, etc.) you can commit all your changes to a \n",
    "GIT repository. This can be done using standard GIT tools or using MLRun `project` methods such as `pull`, `push`, \n",
    "`remote`, which calls the Git API for you.\n",
    "\n",
    "Projects can then be loaded from Git using MLRun `load_project` method, for example: \n",
    "\n",
    "    project = mlrun.load_project(\"./myproj\", \"git://github.com/mlrun/project-demo.git\", name=project_name)\n",
    "    \n",
    "or using MLRun CLI:\n",
    "\n",
    "    mlrun project -n myproj -u \"git://github.com/mlrun/project-demo.git\" ./myproj\n",
    "    \n",
    "Read [CI/CD integration](../../projects/ci-integration.html) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 364,
   "metadata": {},
   "outputs": [],
   "source": [
    "project = mlrun.load_project(\n",
    "    name=\"fraud-demo\",\n",
    "    context=\"./\",\n",
    "    user_project=True,\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating and Evaluating a Feature Vector\n",
    "\n",
    "Models are trained with multiple features, which can arrive from different feature sets and be collected into training (feature) vectors. Feature stores know how to correctly combine the features into a vector by implementing smart JOINs and assessing the time dimension (time traveling).\n",
    "To define a feature vector, you need to specify a name, the list of features it contains, the target features (labels), and other optional parameters. Features are specified as `<FeatureSet>.<Feature> or <FeatureSet>.*`  (all the features in a feature set). The following part demonstrates how to create and use a feature vector.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a feature vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 365,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import MLRun's Feature Store\n",
    "import mlrun.feature_store as fstore\n",
    "\n",
    "# Define the list of features to use\n",
    "features = ['events.*',\n",
    "            'transactions.amount_max_2h', \n",
    "            'transactions.amount_sum_2h', \n",
    "            'transactions.amount_count_2h',\n",
    "            'transactions.amount_avg_2h', \n",
    "            'transactions.amount_max_12h', \n",
    "            'transactions.amount_sum_12h',\n",
    "            'transactions.amount_count_12h', \n",
    "            'transactions.amount_avg_12h', \n",
    "            'transactions.amount_max_24h',\n",
    "            'transactions.amount_sum_24h', \n",
    "            'transactions.amount_count_24h', \n",
    "            'transactions.amount_avg_24h',\n",
    "            'transactions.es_transportation_sum_14d', \n",
    "            'transactions.es_health_sum_14d',\n",
    "            'transactions.es_otherservices_sum_14d', \n",
    "            'transactions.es_food_sum_14d',\n",
    "            'transactions.es_hotelservices_sum_14d', \n",
    "            'transactions.es_barsandrestaurants_sum_14d',\n",
    "            'transactions.es_tech_sum_14d', \n",
    "            'transactions.es_sportsandtoys_sum_14d',\n",
    "            'transactions.es_wellnessandbeauty_sum_14d', \n",
    "            'transactions.es_hyper_sum_14d',\n",
    "            'transactions.es_fashion_sum_14d', \n",
    "            'transactions.es_home_sum_14d', \n",
    "            'transactions.es_travel_sum_14d', \n",
    "            'transactions.es_leisure_sum_14d',\n",
    "            'transactions.gender_F',\n",
    "            'transactions.gender_M',\n",
    "            'transactions.step', \n",
    "            'transactions.amount', \n",
    "            'transactions.timestamp_hour',\n",
    "            'transactions.timestamp_day_of_week']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a feature vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 366,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the feature vector name for future reference\n",
    "fv_name = 'transactions-fraud'\n",
    "\n",
    "# Define the feature vector using the feature store (fstore)\n",
    "transactions_fv = fstore.FeatureVector(fv_name, \n",
    "                          features, \n",
    "                          label_feature=\"labels.label\",\n",
    "                          description='Predicting a fraudulent transaction')\n",
    "\n",
    "# Save the feature vector in the feature store\n",
    "transactions_fv.save()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you have defined the feature vector, you can use `get_offline_features()` to generate the vector dataset and return it as a dataframe or materialize it into a file (CSV or Parquet). The next part demonstrates how to retrieve a vector, materialize it, and view its results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building and Running an Automated Training and Validation Pipeline\n",
    "\n",
    "MLRun allows the building of distributed ML pipelines that can handle data processing, automated feature selection, training, optimization, testing, deployments, and so on. Pipelines are composed of steps that run or deploy custom or library (from the MLRun hub) serverless functions. Pipelines can be run locally (for debugging or small-scale tasks), on a scalable Kubernetes cluster (using Kubeflow), or in a CI/CD system.\n",
    "\n",
    "The example consists of the following pipeline steps (all using pre-defined MLRun hub functions):\n",
    "\n",
    "1. Materialize a feature vector (using `hub://get_offline_features`). \n",
    "2. Select the most optimal features (using `hub://feature_selection`).\n",
    "3. Train the model with multiple algorithms (using `hub://auto_trainer`).\n",
    "4. Evaluate the model (using `hub://auto_trainer`).\n",
    "5. Deploy the model and its application to the test cluster (using `hub://v2_model_server`). The next section will explain the model and application pipeline in detail.\n",
    "\n",
    "Each step can accept the previous steps’ results or data, and generate results, multiple visual artifacts/charts, versioned data objects, and registered models.\n",
    "\n",
    "We have defined the workflow in [`src/new_train_workflow.py`](./src/new_train_workflow.py). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Running the ML pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The workflow/pipeline can be executed using the MLRun SDK (`project.run()` method) or using CLI commands (mlrun project), and can run directly from the source repo (GIT). See details in [MLRun Projects and Automation documentation](https://docs.mlrun.org/en/stable/projects/project.html).\n",
    "\n",
    "You can set arguments and destinations for the different artifacts when you run the workflow. The pipeline progress and results are shown in the notebook. Alternatively, you can check the progress, logs, artifacts, and more, in the MLRun UI or the CI/CD system. The next part demonstrates how to run the pipeline with custom arguments using the SDK."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 368,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>Pipeline running (id=2ccacd0b-7bc6-42c6-86e7-7569d688b8cc), <a href=\"https://dashboard.default-tenant.app.llm2.iguazio-cd0.com/mlprojects/fraud-demo-pengw/jobs/monitor-workflows/workflow/2ccacd0b-7bc6-42c6-86e7-7569d688b8cc\" target=\"_blank\"><b>click here</b></a> to view the details in MLRun UI</div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<!-- Generated by graphviz version 2.43.0 (0)\n",
       " -->\n",
       "<!-- Title: kfp Pages: 1 -->\n",
       "<svg width=\"248pt\" height=\"260pt\"\n",
       " viewBox=\"0.00 0.00 248.05 260.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
       "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 256)\">\n",
       "<title>kfp</title>\n",
       "<polygon fill=\"white\" stroke=\"transparent\" points=\"-4,4 -4,-256 244.05,-256 244.05,4 -4,4\"/>\n",
       "<!-- fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;1055580446 -->\n",
       "<g id=\"node1\" class=\"node\">\n",
       "<title>fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;1055580446</title>\n",
       "<polygon fill=\"green\" stroke=\"black\" points=\"122,-36 4,-36 0,-32 0,0 118,0 122,-4 122,-36\"/>\n",
       "<polyline fill=\"none\" stroke=\"black\" points=\"118,-32 0,-32 \"/>\n",
       "<polyline fill=\"none\" stroke=\"black\" points=\"118,-32 118,0 \"/>\n",
       "<polyline fill=\"none\" stroke=\"black\" points=\"118,-32 122,-36 \"/>\n",
       "<text text-anchor=\"middle\" x=\"61\" y=\"-14.3\" font-family=\"Times,serif\" font-size=\"14.00\">deploy&#45;serving</text>\n",
       "</g>\n",
       "<!-- fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;2101587368 -->\n",
       "<g id=\"node2\" class=\"node\">\n",
       "<title>fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;2101587368</title>\n",
       "<ellipse fill=\"green\" stroke=\"black\" cx=\"125\" cy=\"-234\" rx=\"57.69\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"125\" y=\"-230.3\" font-family=\"Times,serif\" font-size=\"14.00\">get&#45;vector</text>\n",
       "</g>\n",
       "<!-- fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;2908094185 -->\n",
       "<g id=\"node3\" class=\"node\">\n",
       "<title>fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;2908094185</title>\n",
       "<ellipse fill=\"green\" stroke=\"black\" cx=\"125\" cy=\"-162\" rx=\"89.08\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"125\" y=\"-158.3\" font-family=\"Times,serif\" font-size=\"14.00\">feature&#45;selection</text>\n",
       "</g>\n",
       "<!-- fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;2101587368&#45;&gt;fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;2908094185 -->\n",
       "<g id=\"edge1\" class=\"edge\">\n",
       "<title>fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;2101587368&#45;&gt;fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;2908094185</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M125,-215.7C125,-207.98 125,-198.71 125,-190.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"128.5,-190.1 125,-180.1 121.5,-190.1 128.5,-190.1\"/>\n",
       "</g>\n",
       "<!-- fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;3684411306 -->\n",
       "<g id=\"node5\" class=\"node\">\n",
       "<title>fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;3684411306</title>\n",
       "<ellipse fill=\"green\" stroke=\"black\" cx=\"125\" cy=\"-90\" rx=\"33.29\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"125\" y=\"-86.3\" font-family=\"Times,serif\" font-size=\"14.00\">train</text>\n",
       "</g>\n",
       "<!-- fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;2908094185&#45;&gt;fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;3684411306 -->\n",
       "<g id=\"edge2\" class=\"edge\">\n",
       "<title>fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;2908094185&#45;&gt;fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;3684411306</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M125,-143.7C125,-135.98 125,-126.71 125,-118.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"128.5,-118.1 125,-108.1 121.5,-118.1 128.5,-118.1\"/>\n",
       "</g>\n",
       "<!-- fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;2543368571 -->\n",
       "<g id=\"node4\" class=\"node\">\n",
       "<title>fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;2543368571</title>\n",
       "<ellipse fill=\"green\" stroke=\"black\" cx=\"190\" cy=\"-18\" rx=\"50.09\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"190\" y=\"-14.3\" font-family=\"Times,serif\" font-size=\"14.00\">evaluate</text>\n",
       "</g>\n",
       "<!-- fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;3684411306&#45;&gt;fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;1055580446 -->\n",
       "<g id=\"edge3\" class=\"edge\">\n",
       "<title>fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;3684411306&#45;&gt;fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;1055580446</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M110.8,-73.46C102.89,-64.82 92.86,-53.85 83.88,-44.03\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"86.3,-41.48 76.96,-36.46 81.13,-46.2 86.3,-41.48\"/>\n",
       "</g>\n",
       "<!-- fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;3684411306&#45;&gt;fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;2543368571 -->\n",
       "<g id=\"edge4\" class=\"edge\">\n",
       "<title>fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;3684411306&#45;&gt;fraud&#45;detection&#45;pipeline&#45;wsnvk&#45;2543368571</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M139.43,-73.46C147.77,-64.48 158.45,-52.98 167.84,-42.87\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"170.47,-45.18 174.71,-35.47 165.34,-40.41 170.47,-45.18\"/>\n",
       "</g>\n",
       "</g>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<graphviz.graphs.Digraph at 0x7f562659c0d0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<h2>Run Results</h2><h3>[info] Workflow 2ccacd0b-7bc6-42c6-86e7-7569d688b8cc finished, state=Succeeded</h3><br>click the hyper links below to see detailed results<br><table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>uid</th>\n",
       "      <th>start</th>\n",
       "      <th>state</th>\n",
       "      <th>name</th>\n",
       "      <th>parameters</th>\n",
       "      <th>results</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td><div title=\"285102f605634be38b8715f066ecf22e\"><a href=\"https://dashboard.default-tenant.app.llm2.iguazio-cd0.com/mlprojects/fraud-demo-pengw/jobs/monitor/285102f605634be38b8715f066ecf22e/overview\" target=\"_blank\" >...66ecf22e</a></div></td>\n",
       "      <td>Aug 10 00:25:34</td>\n",
       "      <td>completed</td>\n",
       "      <td>evaluate</td>\n",
       "      <td><div class=\"dictlist\">label_columns=label</div><div class=\"dictlist\">model=store://artifacts/fraud-demo-pengw/transaction_fraud_xgboost:2ccacd0b-7bc6-42c6-86e7-7569d688b8cc</div><div class=\"dictlist\">drop_columns=label</div></td>\n",
       "      <td><div class=\"dictlist\">evaluation_accuracy=0.9910044977511244</div><div class=\"dictlist\">evaluation_f1_score=0.30769230769230765</div><div class=\"dictlist\">evaluation_precision_score=0.36363636363636365</div><div class=\"dictlist\">evaluation_recall_score=0.26666666666666666</div></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td><div title=\"fbb56003835e4ef6b1978a9ded393874\"><a href=\"https://dashboard.default-tenant.app.llm2.iguazio-cd0.com/mlprojects/fraud-demo-pengw/jobs/monitor/fbb56003835e4ef6b1978a9ded393874/overview\" target=\"_blank\" >...ed393874</a></div></td>\n",
       "      <td>Aug 10 00:24:48</td>\n",
       "      <td>completed</td>\n",
       "      <td>train</td>\n",
       "      <td><div class=\"dictlist\">sample=-1</div><div class=\"dictlist\">label_column=label</div><div class=\"dictlist\">test_size=0.1</div></td>\n",
       "      <td><div class=\"dictlist\">best_iteration=2</div><div class=\"dictlist\">accuracy=0.9910044977511244</div><div class=\"dictlist\">f1_score=0.30769230769230765</div><div class=\"dictlist\">precision_score=0.36363636363636365</div><div class=\"dictlist\">recall_score=0.26666666666666666</div></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td><div title=\"530bcf86275e42f4901b2ce222cb6580\"><a href=\"https://dashboard.default-tenant.app.llm2.iguazio-cd0.com/mlprojects/fraud-demo-pengw/jobs/monitor/530bcf86275e42f4901b2ce222cb6580/overview\" target=\"_blank\" >...22cb6580</a></div></td>\n",
       "      <td>Aug 10 00:24:20</td>\n",
       "      <td>completed</td>\n",
       "      <td>feature-selection</td>\n",
       "      <td><div class=\"dictlist\">output_vector_name=short</div><div class=\"dictlist\">label_column=label</div><div class=\"dictlist\">k=18</div><div class=\"dictlist\">min_votes=2</div><div class=\"dictlist\">ignore_type_errors=True</div></td>\n",
       "      <td><div class=\"dictlist\">top_features_vector=store://feature-vectors/fraud-demo-pengw/short</div></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td><div title=\"da6210dcf8bf491193564b518c2ce2f0\"><a href=\"https://dashboard.default-tenant.app.llm2.iguazio-cd0.com/mlprojects/fraud-demo-pengw/jobs/monitor/da6210dcf8bf491193564b518c2ce2f0/overview\" target=\"_blank\" >...8c2ce2f0</a></div></td>\n",
       "      <td>Aug 10 00:23:55</td>\n",
       "      <td>completed</td>\n",
       "      <td>get-vector</td>\n",
       "      <td><div class=\"dictlist\">feature_vector=transactions-fraud</div><div class=\"dictlist\">features=[]</div><div class=\"dictlist\">label_feature=labels.label</div><div class=\"dictlist\">target={'name': 'parquet', 'kind': 'parquet'}</div><div class=\"dictlist\">update_stats=True</div></td>\n",
       "      <td><div class=\"dictlist\">feature_vector=transactions-fraud</div><div class=\"dictlist\">feature_vector_uri=store://feature-vectors/fraud-demo-pengw/transactions-fraud:latest</div></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "run_id = project.run(\n",
    "    'main',\n",
    "    arguments={'vector_name':\"transactions-fraud\",\n",
    "                'label_column':\"labels.label\",\n",
    "              }, \n",
    "    dirty=True, watch=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![UI - WorkFlow](images/pipline-ui.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test the model endpoint\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that your model is deployed using the pipeline, you can invoke it as usual:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 356,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "> 2023-08-08 22:00:01,010 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-fraud-demo-pengw-serving.default-tenant.svc.cluster.local:8080/v2/models/fraud/infer'}\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'id': 'db83101d-aa69-47a7-9056-089141ca0b54',\n",
       " 'model_name': 'fraud',\n",
       " 'outputs': [0]}"
      ]
     },
     "execution_count": 356,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Define your serving function\n",
    "serving_fn = project.get_function('serving')\n",
    "\n",
    "# Choose an id for your test\n",
    "sample_id = 'C1000148617'\n",
    "model_inference_path = '/v2/models/fraud/infer'\n",
    "\n",
    "# Send our sample ID for predcition\n",
    "serving_fn.invoke(path=model_inference_path,\n",
    "                  body={'inputs': [[sample_id]]})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Done!\n",
    "\n",
    "You've completed Part 4 of the model training with the feature store.\n",
    "Proceed to part 5 to learn how to deploy real-time application pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "proof-reading",
   "language": "python",
   "name": "conda-env-.conda-proof-reading-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
